Statistical Analysis of AI Salaries (2025) 🤖

Using an AI jobs dataset


Author: Tayfur Akkaya Clavijo
Project: Statistical Analysis & Salary Modeling in the AI Job Market (2025)


Analysis objective:
The purpose of this study is to model and predict salaries for professionals in the AI job market, ensuring statistical excellence and scientific rigor.

Context: Final project for an advanced Statistics for Data Science course.

Table of Contents¶

  1. Data Loading and Setup
  2. Exploratory Data Analysis (EDA)
    2.1 Initial Review
    2.2 Visualizing Variable Distributions
    2.2.1 Target Variable
    2.2.2 Predictor Variables
  3. Outliers
    3.1 Identification
    3.2 Treatment
  4. Statistical Inference
    4.1 Normality Assessment
    4.2 Homoscedasticity Assessment (Equal Variances)
    4.3 Overall Analysis: Kruskal–Wallis
    4.4 Specific Analysis (Post-hoc): Quantifying Differences (Bootstrapping)
  5. Correlation and Association Analysis
  6. Predictive Modeling: Multiple Linear Regression
    6.1 Variable Preparation: Dummy Encoding
    6.2 Model Selection: Linear vs. Logarithmic
    6.3 Detailed Interpretation of Statistical Results
    6.4 Prediction and Confidence Intervals
    6.5 3D Visualization (Error Visualization)
    6.6 Model Uncertainty Analysis
  7. Final Conclusions

1. Data Loading and Setup¶


In [1]:
# Imports
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from scipy import stats
import statsmodels.api as sm
from scipy.stats import bootstrap
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
import kagglehub
import os

# Aesthetic configuration for plots
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Download the latest version of the dataset
path = kagglehub.dataset_download("pratyushpuri/global-ai-job-market-trend-2025")

print("Path to dataset files:", path)

# Exact name of the CSV file
csv_file_name = "ai_job_dataset.csv"

# Combine the folder path with the CSV file name
full_file_path = os.path.join(path, csv_file_name)

# Ensure that the CSV file has been downloaded and loaded correctly
try:
    # Read the CSV file using the full path
    # Apply 'parse_dates' to the date columns
    df = pd.read_csv(
        full_file_path,
        parse_dates=['posting_date', 'application_deadline']
    )
    print(
        f"✅ Dataset '{csv_file_name}' successfully loaded: "
        f"{df.shape[0]} rows, {df.shape[1]} columns."
    )
except FileNotFoundError:
    print(
        f"⚠️ ERROR: '{csv_file_name}' not found. "
        "Please make sure the download process completed correctly."
    )
Path to dataset files: C:\Users\tayfu\.cache\kagglehub\datasets\pratyushpuri\global-ai-job-market-trend-2025\versions\1
✅ Dataset 'ai_job_dataset.csv' successfully loaded: 15000 rows, 19 columns.
In [2]:
import plotly.io as pio
# This option forces Plotly to be saved as a full HTML block inside the notebook file
pio.renderers.default = "notebook"

2. Exploratory Data Analysis (EDA)¶


Our first goal is to understand the data. To do so, we will build a data dictionary based on our observations and by visualizing the distribution of the variables.

Field Name Description Data Type Format / Example Possible Values / Range Nulls Allowed Notes Semantic Type
job_id Unique identifier for each job posting Categorical (string) 'AI00001' N/A No Primary key, no duplicates. Categorical
job_title Job title or role advertised Categorical (string) 'AI Research Scientist' 20 No Includes all possible roles. 20 unique values. Categorical
salary_usd Annual salary converted to USD Integer (int) 90,376 ≥ 0 No Salary expressed in US dollars. Ratio
salary_currency Currency in which the salary is offered Categorical (string) 'USD' 'USD', 'EUR', 'GBP' No USD = US Dollar, EUR = Euro, GBP = Pound Sterling. Categorical
experience_level Required level of professional experience Categorical (string) 'SE' 'EN', 'MI', 'SE', 'EX' No EN = Entry, MI = Mid, SE = Senior, EX = Executive. Categorical / Ordinal
employment_type Type of employment contract Categorical (string) 'CT' 'FT', 'PT', 'CT', 'FL' No CT = Contract, FT = Full-time, PT = Part-time, FL = Freelance. Categorical
company_location Country where the company is based Categorical (string) 'China' 20 No 20 unique country values. Categorical
company_size Size of the company Categorical (string) 'S' 3 No S = Small, M = Medium, L = Large. Categorical / Ordinal
employee_residence Country of employee residence Categorical (string) 'China' 20 No 20 unique country values. Categorical
remote_ratio Percentage of remote work allowed Integer (int) 50 {0, 50, 100} No 0 = On-site, 50 = Hybrid, 100 = Fully remote. Ratio
required_skills Key technical skills required (comma-separated) Categorical (string) 'Python' 13,663 No Skills are comma-separated. 13,663 unique values. Categorical
education_required Minimum education level required Categorical (string) 'Associate' 'Associate', 'Bachelor', 'Master', 'PhD' No 4 unique education levels. Categorical / Ordinal
years_experience Minimum years of experience required Integer (int) 4 0 – 20 No Discrete numeric values. Ratio
industry Industry sector of the job Categorical (string) 'Media' 15 No 15 unique industry sectors. Categorical
posting_date Date when the job was posted Date '2024-10-18' N/A No ISO format: YYYY-MM-DD. Interval
application_deadline Final application deadline Date '2024-11-07' N/A No ISO format: YYYY-MM-DD. Interval
job_description_length Length of the job description (characters) Integer (int) 1,076 N/A No Number of characters in the description text. Ratio
benefits_score Numeric score reflecting benefits quality Float 9.4 0 – 10 No Higher values indicate better benefits. Ratio
company_name Name of the hiring company Categorical (string) 'Smart Analytics' 16 No 16 unique company names. Categorical

2.1 Initial Review¶

In [3]:
# Quick overview of the DataFrame
df.head()
Out[3]:
job_id job_title salary_usd salary_currency experience_level employment_type company_location company_size employee_residence remote_ratio required_skills education_required years_experience industry posting_date application_deadline job_description_length benefits_score company_name
0 AI00001 AI Research Scientist 90376 USD SE CT China M China 50 Tableau, PyTorch, Kubernetes, Linux, NLP Bachelor 9 Automotive 2024-10-18 2024-11-07 1076 5.9 Smart Analytics
1 AI00002 AI Software Engineer 61895 USD EN CT Canada M Ireland 100 Deep Learning, AWS, Mathematics, Python, Docker Master 1 Media 2024-11-20 2025-01-11 1268 5.2 TechCorp Inc
2 AI00003 AI Specialist 152626 USD MI FL Switzerland L South Korea 0 Kubernetes, Deep Learning, Java, Hadoop, NLP Associate 2 Education 2025-03-18 2025-04-07 1974 9.4 Autonomous Tech
3 AI00004 NLP Engineer 80215 USD SE FL India M India 50 Scala, SQL, Linux, Python PhD 7 Consulting 2024-12-23 2025-02-24 1345 8.6 Future Systems
4 AI00005 AI Consultant 54624 EUR EN PT France S Singapore 100 MLOps, Java, Tableau, Python Master 0 Media 2025-04-15 2025-06-23 1989 6.6 Advanced Robotics

We can extract the number of rows and columns with shape:

In [4]:
f,c = df.shape
print(f"The DataFrame contains {f} rows and {c} columns")
The DataFrame contains 15000 rows and 19 columns

We run info():

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   job_id                  15000 non-null  object        
 1   job_title               15000 non-null  object        
 2   salary_usd              15000 non-null  int64         
 3   salary_currency         15000 non-null  object        
 4   experience_level        15000 non-null  object        
 5   employment_type         15000 non-null  object        
 6   company_location        15000 non-null  object        
 7   company_size            15000 non-null  object        
 8   employee_residence      15000 non-null  object        
 9   remote_ratio            15000 non-null  int64         
 10  required_skills         15000 non-null  object        
 11  education_required      15000 non-null  object        
 12  years_experience        15000 non-null  int64         
 13  industry                15000 non-null  object        
 14  posting_date            15000 non-null  datetime64[ns]
 15  application_deadline    15000 non-null  datetime64[ns]
 16  job_description_length  15000 non-null  int64         
 17  benefits_score          15000 non-null  float64       
 18  company_name            15000 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(4), object(12)
memory usage: 2.2+ MB

df.info() only prints information to the screen and does not return a manipulable object (it returns None). Therefore, it is useful to create our own info_df(), which produces a DataFrame summary that we can work with programmatically.

So, to improve on info()—where the information cannot be processed programmatically (it only prints; it does not return anything)—we will build our own summary:

In [6]:
def info_df(df):
    return pd.DataFrame({
        'Column': df.columns,
        'Non-Null': df.notnull().sum().values,
        'Null': df.isnull().sum().values,
        'Python Type': df.dtypes.values,
        'Num. Unique Values': [len(df[col].unique()) for col in df.columns],
        'Sample Values': [df[col].unique()[0:(min(5, len(df[col].unique())))] for col in df.columns],
    })

info = info_df(df)
info
Out[6]:
Column Non-Null Null Python Type Num. Unique Values Sample Values
0 job_id 15000 0 object 15000 [AI00001, AI00002, AI00003, AI00004, AI00005]
1 job_title 15000 0 object 20 [AI Research Scientist, AI Software Engineer, ...
2 salary_usd 15000 0 int64 14315 [90376, 61895, 152626, 80215, 54624]
3 salary_currency 15000 0 object 3 [USD, EUR, GBP]
4 experience_level 15000 0 object 4 [SE, EN, MI, EX]
5 employment_type 15000 0 object 4 [CT, FL, PT, FT]
6 company_location 15000 0 object 20 [China, Canada, Switzerland, India, France]
7 company_size 15000 0 object 3 [M, L, S]
8 employee_residence 15000 0 object 20 [China, Ireland, South Korea, India, Singapore]
9 remote_ratio 15000 0 int64 3 [50, 100, 0]
10 required_skills 15000 0 object 13663 [Tableau, PyTorch, Kubernetes, Linux, NLP, Dee...
11 education_required 15000 0 object 4 [Bachelor, Master, Associate, PhD]
12 years_experience 15000 0 int64 20 [9, 1, 2, 7, 0]
13 industry 15000 0 object 15 [Automotive, Media, Education, Consulting, Hea...
14 posting_date 15000 0 datetime64[ns] 486 [2024-10-18 00:00:00, 2024-11-20 00:00:00, 202...
15 application_deadline 15000 0 datetime64[ns] 543 [2024-11-07 00:00:00, 2025-01-11 00:00:00, 202...
16 job_description_length 15000 0 int64 2000 [1076, 1268, 1974, 1345, 1989]
17 benefits_score 15000 0 float64 51 [5.9, 5.2, 9.4, 8.6, 6.6]
18 company_name 15000 0 object 16 [Smart Analytics, TechCorp Inc, Autonomous Tec...
In [7]:
df.describe(percentiles=[0.25, 0.5, 0.75, 0.95, 0.99]).T
Out[7]:
count mean min 25% 50% 75% 95% 99% max std
salary_usd 15000.0 115348.965133 32519.0 70179.75 99705.0 146408.5 237987.95 307775.69 399095.0 60260.940438
remote_ratio 15000.0 49.483333 0.0 0.0 50.0 100.0 100.0 100.0 100.0 40.812712
years_experience 15000.0 6.2532 0.0 2.0 5.0 10.0 17.0 19.0 19.0 5.545768
posting_date 15000 2024-08-29 08:48:51.840000 2024-01-01 00:00:00 2024-04-29 00:00:00 2024-08-28 00:00:00 2024-12-29 00:00:00 2025-04-06 00:00:00 2025-04-26 00:00:00 2025-04-30 00:00:00 NaN
application_deadline 15000 2024-10-11 21:55:23.520000 2024-01-16 00:00:00 2024-06-13 00:00:00 2024-10-12 00:00:00 2025-02-10 00:00:00 2025-05-21 00:00:00 2025-06-19 00:00:00 2025-07-11 00:00:00 NaN
job_description_length 15000.0 1503.314733 500.0 1003.75 1512.0 2000.0 2402.0 2479.0 2499.0 576.127083
benefits_score 15000.0 7.504273 5.0 6.2 7.5 8.8 9.8 9.9 10.0 1.45087

Caution about outliers (we will revisit this later).

In [8]:
df.describe(include="O").T
Out[8]:
count unique top freq
job_id 15000 15000 AI00001 1
job_title 15000 20 Machine Learning Researcher 808
salary_currency 15000 3 USD 11957
experience_level 15000 4 MI 3781
employment_type 15000 4 FT 3812
company_location 15000 20 Germany 814
company_size 15000 3 S 5007
employee_residence 15000 20 Sweden 790
required_skills 15000 13663 Python, TensorFlow, PyTorch 17
education_required 15000 4 Bachelor 3789
industry 15000 15 Retail 1063
company_name 15000 16 TechCorp Inc 980

Some observations we can already see are:

  • Top job title: Machine Learning Researcher
  • Top salary currency: USD
  • Top experience level: MI — Mid level
  • Top employment type: FT — Full time
  • Top company size: S — Small
  • Top employee residence: Sweden
  • Top required skills (raw field): Python, TensorFlow, PyTorch
  • Top education required: Bachelor's
  • Top industry: Retail

In required_skills, the top value returns multiple skills. This happens because it is returning the most frequent combination, which is not the same as the most demanded individual skills.

Therefore, we will explode the skills list to get a clearer picture of which skills are most requested—an interesting complement to the analysis.

In [9]:
# Convert NaN to an empty string (we don't have any, but we include this as a best practice)
skills_series = df['required_skills'].fillna('')

# Split by commas -> converts each row into a list
skills_lists = skills_series.str.split(',')

# Strip whitespace around each skill
skills_lists = skills_lists.apply(lambda lst: [s.strip() for s in lst if s.strip() != ''])

# Explode -> each skill becomes its own row
skills_exploded = skills_lists.explode()

# Count frequencies
top_skills = skills_exploded.value_counts().head(10)

print("Top most in-demand skills:")
print(top_skills)
Top most in-demand skills:
required_skills
Python        4450
SQL           3407
TensorFlow    3022
Kubernetes    3009
Scala         2794
PyTorch       2777
Linux         2705
Git           2631
Java          2578
GCP           2442
Name: count, dtype: int64

We see very relevant, widely used skills in this industry: first Python, then SQL, and third TensorFlow—highly relevant for the sector.

Now, as a best practice, we will convert ordinal columns to ordered categories in Python (ordered=True) and categorical columns to unordered categories (ordered=False) (except job_id, job_title, and required_skills, which we keep as objects for convenience).

In [10]:
from pandas.api.types import CategoricalDtype

# Define the logical ordering for each ordinal variable
size_order = CategoricalDtype(categories=['S', 'M', 'L'], ordered=True)

# Assume this order: Entry -> Mid -> Senior -> Executive
exp_order = CategoricalDtype(categories=['EN', 'MI', 'SE', 'EX'], ordered=True)

# Assume this academic order
edu_order = CategoricalDtype(categories=['Associate', 'Bachelor', 'Master', 'PhD'], ordered=True)

# Apply the transformations
df['company_size'] = df['company_size'].astype(size_order)
df['experience_level'] = df['experience_level'].astype(exp_order)
df['education_required'] = df['education_required'].astype(edu_order)
In [11]:
nominal_cols = [
    'employment_type', 'salary_currency', 'industry',
    'company_location', 'employee_residence', 'company_name'
]

for col in nominal_cols:
    df[col] = df[col].astype('category')
    # Since we do not specify 'ordered=True', Pandas treats these as unordered categories

This is how it looks now in info:

In [12]:
info = info_df(df)
info
Out[12]:
Column Non-Null Null Python Type Num. Unique Values Sample Values
0 job_id 15000 0 object 15000 [AI00001, AI00002, AI00003, AI00004, AI00005]
1 job_title 15000 0 object 20 [AI Research Scientist, AI Software Engineer, ...
2 salary_usd 15000 0 int64 14315 [90376, 61895, 152626, 80215, 54624]
3 salary_currency 15000 0 category 3 ['USD', 'EUR', 'GBP'] Categories (3, object): ...
4 experience_level 15000 0 category 4 ['SE', 'EN', 'MI', 'EX'] Categories (4, object...
5 employment_type 15000 0 category 4 ['CT', 'FL', 'PT', 'FT'] Categories (4, object...
6 company_location 15000 0 category 20 ['China', 'Canada', 'Switzerland', 'India', 'F...
7 company_size 15000 0 category 3 ['M', 'L', 'S'] Categories (3, object): ['S' <...
8 employee_residence 15000 0 category 20 ['China', 'Ireland', 'South Korea', 'India', '...
9 remote_ratio 15000 0 int64 3 [50, 100, 0]
10 required_skills 15000 0 object 13663 [Tableau, PyTorch, Kubernetes, Linux, NLP, Dee...
11 education_required 15000 0 category 4 ['Bachelor', 'Master', 'Associate', 'PhD'] Cat...
12 years_experience 15000 0 int64 20 [9, 1, 2, 7, 0]
13 industry 15000 0 category 15 ['Automotive', 'Media', 'Education', 'Consulti...
14 posting_date 15000 0 datetime64[ns] 486 [2024-10-18 00:00:00, 2024-11-20 00:00:00, 202...
15 application_deadline 15000 0 datetime64[ns] 543 [2024-11-07 00:00:00, 2025-01-11 00:00:00, 202...
16 job_description_length 15000 0 int64 2000 [1076, 1268, 1974, 1345, 1989]
17 benefits_score 15000 0 float64 51 [5.9, 5.2, 9.4, 8.6, 6.6]
18 company_name 15000 0 category 16 ['Smart Analytics', 'TechCorp Inc', 'Autonomou...

If the number of distinct values is not too large, we can display each value and its frequency to examine the variables in more detail:

In [13]:
for col in df.columns:
    if len(df[col].unique())<45:
        print(df[col].value_counts())
        print("="*40)
job_title
Machine Learning Researcher    808
AI Software Engineer           784
Autonomous Systems Engineer    777
Machine Learning Engineer      772
AI Architect                   771
Head of AI                     765
NLP Engineer                   762
Robotics Engineer              759
Data Analyst                   759
AI Research Scientist          756
Data Engineer                  749
AI Product Manager             743
Research Scientist             742
Principal Data Scientist       734
AI Specialist                  728
ML Ops Engineer                725
Computer Vision Engineer       724
Data Scientist                 720
Deep Learning Engineer         718
AI Consultant                  704
Name: count, dtype: int64
========================================
salary_currency
USD    11957
EUR     2314
GBP      729
Name: count, dtype: int64
========================================
experience_level
MI    3781
EX    3760
SE    3741
EN    3718
Name: count, dtype: int64
========================================
employment_type
FT    3812
FL    3758
CT    3721
PT    3709
Name: count, dtype: int64
========================================
company_location
Germany           814
Denmark           778
Canada            769
France            769
Austria           765
Singapore         764
China             763
India             754
Sweden            752
Israel            751
Ireland           750
Switzerland       746
Finland           733
Japan             733
Australia         732
Netherlands       731
United Kingdom    729
United States     724
South Korea       722
Norway            721
Name: count, dtype: int64
========================================
company_size
S    5007
L    4998
M    4995
Name: count, dtype: int64
========================================
employee_residence
Sweden            790
France            781
Denmark           777
Austria           776
India             772
Germany           769
South Korea       763
Canada            762
China             761
Netherlands       758
United Kingdom    750
Switzerland       748
Ireland           740
Singapore         740
Israel            731
Australia         730
Norway            726
United States     716
Finland           710
Japan             700
Name: count, dtype: int64
========================================
remote_ratio
0      5075
50     5005
100    4920
Name: count, dtype: int64
========================================
education_required
Bachelor     3789
Associate    3785
Master       3748
PhD          3678
Name: count, dtype: int64
========================================
years_experience
0     1890
1     1828
4     1295
3     1247
2     1239
7      769
5      757
6      753
9      742
8      720
16     403
15     392
13     391
10     384
19     378
11     373
14     364
17     363
12     362
18     350
Name: count, dtype: int64
========================================
industry
Retail                1063
Media                 1045
Automotive            1020
Consulting            1020
Technology            1011
Real Estate           1007
Government             998
Healthcare             997
Telecommunications     997
Transportation         997
Finance                984
Energy                 976
Gaming                 967
Manufacturing          962
Education              956
Name: count, dtype: int64
========================================
company_name
TechCorp Inc                  980
Cognitive Computing           972
AI Innovations                964
Digital Transformation LLC    961
Future Systems                960
Quantum Computing Inc         960
Cloud AI Solutions            951
Predictive Systems            947
Smart Analytics               927
Advanced Robotics             925
Machine Intelligence Group    922
Neural Networks Co            922
Autonomous Tech               918
DataVision Ltd                909
DeepTech Ventures             897
Algorithmic Solutions         885
Name: count, dtype: int64
========================================

Overall, the value frequencies do not show large differences.

In [14]:
df.duplicated().sum()   # no duplicates
Out[14]:
0

Code to compute the mode of each column and its relevance:

In [15]:
# Show only the first (most frequent) value in each column, along with its count
header = f"{'Column':<25} {'Mode':<30} {'Frequency':<15} {'% of non-nulls':<15}"
print(header)
print("-" * len(header))

for col in df.columns:
    column_non_null = df[col].dropna()  # drop nulls before counting rows (we don't have any, but it's good practice)
    frequencies = df[col].value_counts()
    mode_value = frequencies.index[0] if not df[col].mode().empty else None
    non_null_count = df[col].notnull().sum()
    mode_count = frequencies.values[0]
    mode_percentage = (mode_count / non_null_count * 100) if non_null_count > 0 else 0
    print(f"{col:<25} {str(mode_value):<30} {mode_count:<15} {round(mode_percentage, 2):<15}")
Column                    Mode                           Frequency       % of non-nulls 
----------------------------------------------------------------------------------------
job_id                    AI00001                        1               0.01           
job_title                 Machine Learning Researcher    808             5.39           
salary_usd                67253                          4               0.03           
salary_currency           USD                            11957           79.71          
experience_level          MI                             3781            25.21          
employment_type           FT                             3812            25.41          
company_location          Germany                        814             5.43           
company_size              S                              5007            33.38          
employee_residence        Sweden                         790             5.27           
remote_ratio              0                              5075            33.83          
required_skills           Python, TensorFlow, PyTorch    17              0.11           
education_required        Bachelor                       3789            25.26          
years_experience          0                              1890            12.6           
industry                  Retail                         1063            7.09           
posting_date              2024-07-05 00:00:00            51              0.34           
application_deadline      2025-01-05 00:00:00            47              0.31           
job_description_length    1519                           19              0.13           
benefits_score            9.9                            338             2.25           
company_name              TechCorp Inc                   980             6.53           

Consistent with what we have seen so far.

In [16]:
# Code to get numeric columns
numeric_columns = df.select_dtypes(include=['number']).columns.tolist()

header = f"{'Column':<30} {'Mean':<10} {'Median':<10}"
print(header)
print("-" * len(header))
for col in numeric_columns:
    mean = df[col].mean()
    median = df[col].median()
    print(f"{col:<30} {round(mean,2):<10} {round(median,2):<10}")
Column                         Mean       Median    
----------------------------------------------------
salary_usd                     115348.97  99705.0   
remote_ratio                   49.48      50.0      
years_experience               6.25       5.0       
job_description_length         1503.31    1512.0    
benefits_score                 7.5        7.5       

When the mean is very different from the median (as with salary_usd), it indicates a strongly skewed distribution (in the direction of the mean). Since the mean is more affected by extreme values, the median is the preferred measure of central tendency (more robust). In our case, we can infer the presence of very high salaries.

To analyze our variables of interest in more detail, we will plot them next.

2.2 Visualizing Variable Distributions¶

We will plot our target variable salary_usd to see its distribution. We will also visualize how salary behaves with respect to our variables of interest (experience_level and company_size), which will be our candidate predictors.

2.2.1 Target Variable¶

For the histogram of salary_usd (our target variable), we will take one additional step to compute the optimal number of bins.

We choose the Freedman–Diaconis rule to determine the optimal number of bins in the histogram because our data shows significant positive skewness. This rule uses the interquartile range (IQR), which makes it robust to extreme values and suitable for real-world, non-normal distributions.

It also adapts dynamically to the sample size and avoids selecting an arbitrary number of bins, providing a more faithful visualization of the distribution.

We can do it automatically with bins="auto" (first alternative in the next code cell):

  • Using this approach, two proposals of bins are computed:

    • Sturges → good for near-normal data
    • Freedman–Diaconis → good for skewed distributions

    Matplotlib then uses the larger of the two (more bins).

In our case:

The distribution is positively skewed → the IQR captures the spread → Freedman–Diaconis detects that asymmetry → FD yields many more bins than Sturges → Matplotlib selects Freedman–Diaconis because it produces more bins.

Or we can do it manually (second code cell).

In [17]:
# Automatic alternative using bins='auto'
# 'auto' usually uses Freedman–Diaconis when the dataset is large (we have ~15,000 rows)
sns.histplot(df['salary_usd'], bins='auto', kde=True)
plt.title('Salary distribution')
plt.xlabel('Salary (USD)')
plt.show()
No description has been provided for this image
In [18]:
# Manual alternative using Freedman–Diaconis
# Compute the interquartile range (IQR)
q1 = df['salary_usd'].quantile(0.25)
q3 = df['salary_usd'].quantile(0.75)
iqr = q3 - q1

# Compute bin width (Freedman–Diaconis rule)
# h = 2 * IQR * n^(-1/3)
data_len = len(df['salary_usd'])
bin_width = 2 * iqr * (data_len ** (-1/3))

# Compute the number of bins
data_range = df['salary_usd'].max() - df['salary_usd'].min()
num_bins = int(data_range / bin_width)

print(f"Optimal number of bins (Freedman–Diaconis): {num_bins}")

# Plot using the computed number of bins
sns.histplot(df['salary_usd'], bins=num_bins, kde=True)

plt.title('Salary distribution')
plt.xlabel('Salary (USD)')
plt.show()
Optimal number of bins (Freedman–Diaconis): 59
No description has been provided for this image

We can see it is heavily right-skewed (a long tail of high salaries). Most salaries cluster in the low/medium range, but there is a long tail of very high salaries. This matters for the statistical tests.

Now let's look at the boxplot:

In [19]:
plt.figure(figsize=(8, 6))
sns.boxplot(x=df['salary_usd'], color='lightgray')
plt.title('Salary distribution (USD)', fontsize=14)
plt.xlabel('Salary (USD)')
plt.show()
No description has been provided for this image

The boxplot shows that the salary distribution is clearly right-skewed. The median is around 100,000 USD.

Most salaries fall between the first and third quartiles, but there is a long tail with many very high values, shown as outliers. This confirms that salary_usd has high variability and extreme salaries.

This suggests we should treat outliers and/or consider a logarithmic scale in our regression model.

2.2.2 Predictor Variables¶

Now we will visualize each predictor individually, and then its relationship with salary.

In [20]:
# --- ANALYSIS OF PREDICTOR VARIABLE: experience_level ---

plt.figure(figsize=(10, 5))

# Univariate plot
el_count = df['experience_level'].value_counts()

sns.barplot(x=el_count.index, y=el_count.values)
plt.title('Frequency by experience level')
plt.xlabel('Experience level')
plt.ylabel('Employees')

ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

for i, v in enumerate(el_count.values):
    plt.text(i, v + 10, str(v), ha='center', va='bottom')

plt.show()
No description has been provided for this image

Very similar—almost identical.

In [21]:
# Bivariate plot (X vs Y)
plt.figure(figsize=(10, 6))
sns.boxplot(x='experience_level', y='salary_usd', data=df)
plt.title('Salary distribution by experience level', fontsize=14)
plt.xlabel('Experience level')
plt.ylabel('Salary (USD)')
plt.show()
No description has been provided for this image

The boxplot confirms our suspicion: as experience increases, the median salary increases.
We also clearly see points above the whiskers—those are the outliers we had already noticed.

So the boxplot helps us explain how the salary median (the central line) increases with experience level, and it also sets up the next step about outliers.

In [22]:
# --- ANALYSIS OF PREDICTOR VARIABLE: company_size ---

plt.figure(figsize=(10, 5))

# Univariate plot
cs_count = df['company_size'].value_counts()

sns.barplot(x=cs_count.index, y=cs_count.values)
plt.title('Frequency by company size')
plt.xlabel('Company size')
plt.ylabel('Employees')

ax = plt.gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

for i, v in enumerate(cs_count.values):
    plt.text(i, v + 10, str(v), ha='center', va='bottom')

plt.show()
No description has been provided for this image

Very similar as well—almost identical.

In [23]:
# Bivariate plot (X vs Y)
plt.figure(figsize=(10, 6))
sns.boxplot(x='company_size', y='salary_usd', data=df)
plt.title('Salary distribution by company size', fontsize=14)
plt.xlabel('Company size')
plt.ylabel('Salary (USD)')
plt.show()
No description has been provided for this image

The boxplot suggests that the larger the company, the higher the median salary. We also again clearly see points above the whiskers (outliers).

So the boxplot helps us explain how the median salary increases with company size.

3. Outliers¶


Salaries often have extreme values that can distort Linear Regression (which is sensitive to them).

In the previous section we identified outliers visually in our target variable. Now we will apply a numerical method to robustly locate outliers, and then decide and justify how we will treat them.

3.1 Identification¶

Following the same logic as the boxplots, but in a numerical way, we will apply the Interquartile Range (IQR) method, also known as Tukey’s Fences, since for our case (a tabular problem) it is an appropriate method to identify outliers in our target variable.

In [24]:
threshold = 1.5
Q1 = df['salary_usd'].quantile(0.25)
Q3 = df['salary_usd'].quantile(0.75)
IQR = Q3 - Q1

floor = Q1 - threshold * IQR
ceiling = Q3 + threshold * IQR

print(f"Interquartile Range (IQR): {IQR:,.0f}")
print(f"Tukey's fences: < {floor:,.0f} and > {ceiling:,.0f}")
Interquartile Range (IQR): 76,229
Tukey's fences: < -44,163 and > 260,752

3.2 Treatment¶

For data cleaning, we chose Tukey's statistical criterion (1.5·IQR) instead of setting an arbitrary or fixed threshold (such as the 99th percentile). This method is dynamic and allows us to detect salaries that deviate significantly from the central market distribution, ensuring that our regression model is trained on data representative of the average professional and is not biased by exceptional salaries.

In other words, while this excludes a real niche of high executives (C-suite > $260k), we prioritize the model’s generalization capability. This way, we prevent that corporate elite from distorting predictions for the vast majority of professionals in the sector.

In [25]:
# Identify how many points fall outside the fences
outliers = df[(df['salary_usd'] < floor) | (df['salary_usd'] > ceiling)]
print(f"Detected {len(outliers)} outliers ({len(outliers)/len(df):.2%} of the dataset).")

# Decision: filter them out
df_clean = df.drop(outliers.index).copy()

print("Treatment applied: removed records outside Tukey's fences.")

# Visualization after outlier treatment
plt.figure(figsize=(10, 4))
sns.boxplot(x=df_clean['salary_usd'])
plt.title('Distribution after removing outliers')
plt.show()

df_clean
Detected 483 outliers (3.22% of the dataset).
Treatment applied: removed records outside Tukey's fences.
No description has been provided for this image
Out[25]:
job_id job_title salary_usd salary_currency experience_level employment_type company_location company_size employee_residence remote_ratio required_skills education_required years_experience industry posting_date application_deadline job_description_length benefits_score company_name
0 AI00001 AI Research Scientist 90376 USD SE CT China M China 50 Tableau, PyTorch, Kubernetes, Linux, NLP Bachelor 9 Automotive 2024-10-18 2024-11-07 1076 5.9 Smart Analytics
1 AI00002 AI Software Engineer 61895 USD EN CT Canada M Ireland 100 Deep Learning, AWS, Mathematics, Python, Docker Master 1 Media 2024-11-20 2025-01-11 1268 5.2 TechCorp Inc
2 AI00003 AI Specialist 152626 USD MI FL Switzerland L South Korea 0 Kubernetes, Deep Learning, Java, Hadoop, NLP Associate 2 Education 2025-03-18 2025-04-07 1974 9.4 Autonomous Tech
3 AI00004 NLP Engineer 80215 USD SE FL India M India 50 Scala, SQL, Linux, Python PhD 7 Consulting 2024-12-23 2025-02-24 1345 8.6 Future Systems
4 AI00005 AI Consultant 54624 EUR EN PT France S Singapore 100 MLOps, Java, Tableau, Python Master 0 Media 2025-04-15 2025-06-23 1989 6.6 Advanced Robotics
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14995 AI14996 Robotics Engineer 38604 USD EN FL Finland S Finland 50 Java, Kubernetes, Azure Bachelor 1 Energy 2025-02-06 2025-03-25 1635 7.9 Advanced Robotics
14996 AI14997 Machine Learning Researcher 57811 GBP EN CT United Kingdom M United Kingdom 0 Mathematics, Docker, SQL, Deep Learning Master 0 Government 2024-10-16 2024-10-30 1624 8.2 Smart Analytics
14997 AI14998 NLP Engineer 189490 USD EX CT South Korea L South Korea 50 Scala, Spark, NLP Associate 17 Manufacturing 2024-03-19 2024-05-02 1336 7.4 AI Innovations
14998 AI14999 Head of AI 79461 EUR EN FT Netherlands M Netherlands 0 Java, Computer Vision, Python, TensorFlow PhD 1 Real Estate 2024-03-22 2024-04-23 1935 5.6 Smart Analytics
14999 AI15000 Computer Vision Engineer 56481 USD MI PT Austria S Austria 50 Scala, Azure, Deep Learning, GCP, Mathematics PhD 2 Technology 2024-07-18 2024-08-10 2492 7.6 AI Innovations

14517 rows × 19 columns

4. Statistical Inference¶


Before running regression, we should scientifically validate whether the differences we see in the plots are statistically significant.

In particular, before moving to predictive modeling, we assess whether observed salary differences across groups are not only statistically significant, but also practically meaningful. This distinction is crucial: large datasets can produce very small p-values even when real-world effects are modest.

So, to determine which variables truly influence salary and justify their inclusion in the future regression model, we will run a set of statistical tests.

4.1 Normality Assessment¶

The first step is to verify the distribution of the dependent variable (salary_usd). Since our dataset is large (N > 5000), we discard the Shapiro–Wilk test (too sensitive for large samples) and instead use the D’Agostino–Pearson test, complemented visually with Q–Q plots.

Our validation strategy has two levels:

  1. Global level: run a test on the full salary_usd variable to confirm statistically the skewness detected during EDA.
  2. Group level: check normality within each subgroup (e.g., salary distribution only for Seniors).
    • Decision: if we fail at this critical level, we will discard parametric methods and use robust alternatives such as Kruskal–Wallis.
In [26]:
# GLOBAL ANALYSIS
print("GLOBAL NORMALITY TEST (Variable: salary_usd)")
stat, p_value = stats.normaltest(df_clean['salary_usd'])

print(f"   Statistic: {stat:.4f}, p-value: {p_value:.4e}")
if p_value < 0.05:
    print("   ✅ CONCLUSION: We reject global normality (consistent with the skewness seen in EDA).")
else:
    print("   ❌ CONCLUSION: We cannot reject global normality.")
GLOBAL NORMALITY TEST (Variable: salary_usd)
   Statistic: 1287.3386, p-value: 2.8706e-280
   ✅ CONCLUSION: We reject global normality (consistent with the skewness seen in EDA).
In [27]:
# GROUP-BASED ANALYSIS
def check_normality_groups(df, group_col, target_col):
    # Use .cat.categories to respect our hierarchical ordering
    unique_groups = df[group_col].cat.categories

    # Plot configuration
    # Dynamically adjust the figure size based on the number of groups
    fig, axes = plt.subplots(1, len(unique_groups), figsize=(5 * len(unique_groups), 4))

    print(f"\nNORMALITY TEST BY GROUPS: {group_col}")

    for i, group in enumerate(unique_groups):
        # Filter data for the specific group
        data = df[df[group_col] == group][target_col]

        # A. Statistical test (D'Agostino)
        if len(data) >= 8:  # Minimum required for the test; otherwise it would not be valid
            stat, p_value = stats.normaltest(data)
            normality = "FAIL TO REJECT NORMALITY" if p_value > 0.05 else "NOT NORMAL"
            print(f"   -> Group '{group}' (n={len(data)}): p-value={p_value:.4e} => {normality}")
        else:
            print(f"   -> Group '{group}': Not enough data for a reliable test.")

        # B. Q-Q Plot
        ax = axes[i] if len(unique_groups) > 1 else axes
        sm.qqplot(data, line='s', ax=ax, markerfacecolor='blue', markeredgecolor='blue', alpha=0.3)
        ax.set_title(f'Q-Q Plot: {group}')

    plt.tight_layout()
    plt.show()


# Run for Experience Level
check_normality_groups(df_clean, 'experience_level', 'salary_usd')

# Run for Company Size
check_normality_groups(df_clean, 'company_size', 'salary_usd')
NORMALITY TEST BY GROUPS: experience_level
   -> Group 'EN' (n=3718): p-value=5.1094e-60 => NOT NORMAL
   -> Group 'MI' (n=3781): p-value=1.5729e-54 => NOT NORMAL
   -> Group 'SE' (n=3741): p-value=5.2348e-54 => NOT NORMAL
   -> Group 'EX' (n=3277): p-value=1.0541e-72 => NOT NORMAL
No description has been provided for this image
NORMALITY TEST BY GROUPS: company_size
   -> Group 'S' (n=4941): p-value=1.1475e-126 => NOT NORMAL
   -> Group 'M' (n=4855): p-value=2.7140e-108 => NOT NORMAL
   -> Group 'L' (n=4721): p-value=2.5776e-71 => NOT NORMAL
No description has been provided for this image

Interpretation: why “non-normal” and what does it imply?¶

Normality tests (such as D’Agostino) detect deviations from normality. In large datasets (like this one), these tests are very sensitive: even small deviations can produce extremely low p-values.

That’s why the correct reading is not only “it is not normal”, but:

  • Consistency with EDA: the salary distribution shows positive skewness (right tail), consistent with very high salaries in a subset.
  • Implication for inference: since normality is not met (and homoscedasticity also fails), we prioritize robust / non-parametric methods to compare groups (Kruskal–Wallis) and quantify differences (bootstrapping).
  • Implication for regression: in OLS, what matters is the behavior of the residuals (linearity, constant variance). That is why later we evaluate residuals and apply a log-salary transformation to stabilize variance.

In short: non-normality does not invalidate the analysis, but it conditions the choice of methods and improves the traceability of your methodological decisions.

4.2 Homoscedasticity Assessment (Equal Variances)¶

Since we have already seen that the data are not normally distributed, Bartlett’s test is not reliable. Instead, we apply Levene’s test, which is more robust because it does not assume normality.

In [28]:
def check_homoscedasticity(df, group_col, target_col):
    # Prepare groups
    groups = [df[df[group_col] == g][target_col] for g in df[group_col].unique()]

    # Levene test (center='median' is more robust for skewed data)
    stat, p_value = stats.levene(*groups, center='median')

    print(f"--- Levene test ({group_col}) ---")
    print(f"W statistic: {stat:.4f}")
    print(f"p-value: {p_value:.4e}")

    if p_value < 0.05:
        print("✅ CONCLUSION: Variances are significantly different (heteroscedasticity).")
    else:
        print("❌ CONCLUSION: No evidence that variances differ.")

# Check for experience and company size
check_homoscedasticity(df_clean, 'experience_level', 'salary_usd')
check_homoscedasticity(df_clean, 'company_size', 'salary_usd')
--- Levene test (experience_level) ---
W statistic: 835.7112
p-value: 0.0000e+00
✅ CONCLUSION: Variances are significantly different (heteroscedasticity).
--- Levene test (company_size) ---
W statistic: 16.2998
p-value: 8.4926e-08
✅ CONCLUSION: Variances are significantly different (heteroscedasticity).

Interpretation: what does it mean if variances are different?¶

When Levene’s test rejects equality of variances, it tells us that salary dispersion is not the same across groups. This adds nuance beyond “who earns more”:

  • Which group varies more? A higher variance (or standard deviation / IQR) indicates a wider salary band: more heterogeneity in compensation within that segment (e.g., negotiation, specialization, location, bonuses).
  • What does it imply for inference? With heteroscedasticity, parametric tests that assume equal variances (e.g., classic ANOVA) are less appropriate; robust or non-parametric alternatives become preferable (Kruskal–Wallis).
  • What does it imply for modeling? In linear regression, heteroscedasticity breaks one of the Gauss–Markov assumptions (constant error variance), affecting the validity of standard errors and confidence intervals. That is why later we use a log transformation and residual diagnostics.

To make this concrete, we quantify dispersion by group below (variance, standard deviation, IQR, CV) to identify which category varies the most and what insight that adds to the analysis.

In [29]:
# Quantify dispersion by group to understand "who varies more" (not only who earns more)

def dispersion_table(df, group_col, target_col="salary_usd", observed=True):
    """Return a dispersion table by group (variance, std, IQR, CV)."""
    g = df.groupby(group_col, observed=observed)[target_col]
    summary = g.agg(
        n="count",
        mean="mean",
        median="median",
        std="std",
        var="var",
        q1=lambda x: x.quantile(0.25),
        q3=lambda x: x.quantile(0.75),
    )
    summary["iqr"] = summary["q3"] - summary["q1"]
    summary["cv"] = summary["std"] / summary["mean"]
    return summary.sort_values("std", ascending=False)

for col in ["experience_level", "company_size"]:
    disp = dispersion_table(df_clean, col, "salary_usd", observed=True)
    display(disp)

    g_max = disp["std"].idxmax()
    g_min = disp["std"].idxmin()
    std_ratio = disp.loc[g_max, "std"] / disp.loc[g_min, "std"]

    print(
        f"➡️ Highest dispersion (STD) in '{col}': {g_max} "
        f"(std={disp.loc[g_max,'std']:,.0f} | IQR={disp.loc[g_max,'iqr']:,.0f} | CV={disp.loc[g_max,'cv']:.2f})"
    )
    print(
        f"➡️ Lowest dispersion (STD) in '{col}': {g_min} "
        f"(std={disp.loc[g_min,'std']:,.0f} | IQR={disp.loc[g_min,'iqr']:,.0f} | CV={disp.loc[g_min,'cv']:.2f})"
    )
    print(f"📈 Dispersion ratio (std max / std min): {std_ratio:.2f}")

    # Quick visualization: standard deviation by group
    plt.figure(figsize=(8, 3))
    disp["std"].sort_values().plot(kind="barh")
    plt.title(f"Salary standard deviation by group: {col}")
    plt.xlabel("STD(salary_usd)")
    plt.ylabel(col)
    plt.show()
n mean median std var q1 q3 iqr cv
experience_level
EX 3277 171357.764724 167899.0 42645.767697 1.818662e+09 138802.0 203278.0 64476.0 0.248870
SE 3741 122187.657845 116907.0 35261.617316 1.243382e+09 94173.0 145315.0 51142.0 0.288586
MI 3781 87955.471833 84641.0 25543.109400 6.524504e+08 67621.0 104483.0 36862.0 0.290410
EN 3718 63133.377084 60373.5 18558.096806 3.444030e+08 48516.0 74866.5 26350.5 0.293951
➡️ Highest dispersion (STD) in 'experience_level': EX (std=42,646 | IQR=64,476 | CV=0.25)
➡️ Lowest dispersion (STD) in 'experience_level': EN (std=18,558 | IQR=26,350 | CV=0.29)
📈 Dispersion ratio (std max / std min): 2.30
No description has been provided for this image
n mean median std var q1 q3 iqr cv
company_size
L 4721 120178.613429 111160.0 51196.951526 2.621128e+09 80483.0 155554.0 75071.0 0.426007
M 4855 108280.711843 96759.0 50461.054640 2.546318e+09 70841.5 139181.0 68339.5 0.466021
S 4941 99750.428658 88895.0 48242.994196 2.327386e+09 63454.0 128067.0 64613.0 0.483637
➡️ Highest dispersion (STD) in 'company_size': L (std=51,197 | IQR=75,071 | CV=0.43)
➡️ Lowest dispersion (STD) in 'company_size': S (std=48,243 | IQR=64,613 | CV=0.48)
📈 Dispersion ratio (std max / std min): 1.06
No description has been provided for this image

Interpreting the dispersion table: who varies more and why does it matter?¶

Once we compute variance / standard deviation / IQR / coefficient of variation (CV) by group, we can extract conclusions such as:

  • Highest dispersion: the group with the largest STD/IQR has the widest salary range, which typically reflects stronger heterogeneity (mixed roles, broader seniority within the same label, bonuses, etc.).
  • Lowest dispersion: the group with the smallest dispersion has more standardized pay bands (more homogeneous positions and clearer market pricing).

Therefore, higher dispersion within certain groups suggests heterogeneous career paths and compensation structures, possibly reflecting differences in seniority, specialization, negotiation power, or firm-level policies rather than a single dominant salary trajectory.

This is important because two groups can have different medians and very different uncertainty around those medians. In practice, that translates to wider vs. narrower salary “bands” for negotiation and benchmarking.

4.3 Overall Analysis: Kruskal–Wallis¶

Assumptions check (parametric requirements)¶

Before choosing a global test, we verified the main ANOVA assumptions:

  1. Normality: rejected (p < 0.05 in D’Agostino’s test).
  2. Homoscedasticity (equal variances): rejected (p < 0.05 in Levene’s test).

Because these parametric assumptions are violated, and because we need to compare more than two independent groups, we discard the one-way ANOVA.

Theoretical note: The Mann–Whitney U test is only appropriate for comparing two independent groups.
When the goal is to compare more than two groups in a non-parametric and robust way, we use the Kruskal–Wallis test, which generalizes Mann–Whitney to the multi-group setting.

Why Kruskal–Wallis?¶

Given the non-normality and heteroscedasticity, we adopt the Kruskal–Wallis test as the non-parametric alternative to ANOVA.
It evaluates whether the salary distributions differ across several groups (e.g., experience levels). We apply it to each candidate predictor to confirm whether it has a global effect on salary.

Hypotheses (Kruskal–Wallis)¶

  • H₀: The salary distributions are the same across all groups (often interpreted as “equal medians”).
  • H₁: At least one group differs (i.e., at least one distribution is systematically shifted relative to another).

Technical note (interpretation): Although Kruskal–Wallis is commonly described as a comparison of medians, strictly speaking it tests for stochastic dominance / distributional differences.
If group distributions have different shapes (which is plausible given the variance differences suggested by Levene’s test), a significant result means that for at least one pair of groups the probability that a randomly chosen observation from one group exceeds a randomly chosen observation from another is systematically different from 0.5.
This interpretation is more general and robust than “median differences” alone.

In [30]:
def test_kruskal_global(df, group_col, target_col):
    groups = [df[df[group_col] == val][target_col] for val in df[group_col].unique()]
    stat, p_value = stats.kruskal(*groups)

    print(f"Kruskal–Wallis test for variable: {group_col}")
    print(f"   H statistic: {stat:.4f} | p-value: {p_value:.4e}")

    if p_value < 0.05:
        print(f"   ✅ H0 rejected: '{group_col}' significantly affects salary.")
    else:
        print(f"   ❌ H0 not rejected: we cannot claim significant differences.")
    print("-" * 98)

# Apply to our two candidate predictors
test_kruskal_global(df_clean, 'experience_level', 'salary_usd')
test_kruskal_global(df_clean, 'company_size', 'salary_usd')
Kruskal–Wallis test for variable: experience_level
   H statistic: 9368.7891 | p-value: 0.0000e+00
   ✅ H0 rejected: 'experience_level' significantly affects salary.
--------------------------------------------------------------------------------------------------
Kruskal–Wallis test for variable: company_size
   H statistic: 449.3941 | p-value: 2.6021e-98
   ✅ H0 rejected: 'company_size' significantly affects salary.
--------------------------------------------------------------------------------------------------
In [31]:
# Effect size for Kruskal–Wallis: how large is the global difference?

def kruskal_effect_size(df, group_col, target_col="salary_usd", observed=True):
    groups = [g[target_col].values for _, g in df.groupby(group_col, observed=observed)]
    H, p = stats.kruskal(*groups)
    n = sum(len(x) for x in groups)
    k = len(groups)

    # Epsilon squared (ε²) is a common effect size for Kruskal–Wallis
    epsilon_sq = (H - k + 1) / (n - k)

    # Alternative: approximate eta squared (η²) for reference
    eta_sq = H / (n - 1)

    return H, p, epsilon_sq, eta_sq, n, k

for col in ["experience_level", "company_size"]:
    H, p, eps2, eta2, n, k = kruskal_effect_size(df_clean, col, "salary_usd", observed=True)
    print(f"\nEffect size for: {col}")
    print(f"H = {H:,.2f} | p = {p:.2e} | n = {n} | k = {k}")
    print(f"ε² (epsilon squared) = {eps2:.4f}")
    print(f"η² (approx.)         = {eta2:.4f}")

    # Rule of thumb (indicative): 0.01 small, 0.06 moderate, 0.14 large
    if eps2 < 0.01:
        mag = "small"
    elif eps2 < 0.06:
        mag = "moderate"
    elif eps2 < 0.14:
        mag = "medium"
    else:
        mag = "large"
    print(f"➡️ Interpretation (indicative): {mag} effect for {col}.")
Effect size for: experience_level
H = 9,368.79 | p = 0.00e+00 | n = 14517 | k = 4
ε² (epsilon squared) = 0.6453
η² (approx.)         = 0.6454
➡️ Interpretation (indicative): large effect for experience_level.

Effect size for: company_size
H = 449.39 | p = 2.60e-98 | n = 14517 | k = 3
ε² (epsilon squared) = 0.0308
η² (approx.)         = 0.0310
➡️ Interpretation (indicative): moderate effect for company_size.

Effect size: how large is the difference?¶

Statistical significance tells us whether differences exist, but not how large they are. Therefore, we compute an effect size for Kruskal–Wallis (e.g., $\varepsilon^2$ or an approximate $\eta^2$) to quantify the proportion of variance explained by each categorical predictor.

This answers: Is the factor merely significant, or also meaningful in magnitude?

4.4 Specific Analysis (Post-hoc): Quantifying Differences (Bootstrapping)¶

The global Kruskal–Wallis test confirms that significant differences exist across groups, but it does not tell us where those differences are or how large they are.

To deepen the analysis, we run a post-hoc study using bootstrapping. This technique allows us to build a 95% Confidence Interval (CI) for the difference in central tendency between key pairs (typically the difference in means, and optionally the difference in medians). This provides two complementary benefits:

  1. Statistical Significance:
    If the CI for the difference does not include 0, we interpret it as evidence of a real disparity between groups (conceptually equivalent to a p-value < 0.05).

  2. Economic / Business Quantification:
    Unlike a p-value, the CI yields an interpretable range in dollar terms (e.g., “they earn between X and Y more”), which is crucial for business decision-making and compensation strategy.

All confidence intervals are constructed via 95% bootstrapping.

Why bootstrapping?

  • It is non-parametric and does not rely on normality assumptions.
  • It produces an intuitive confidence interval for the difference (mean and/or median).
  • It is robust to skewness and outliers, and easy to interpret in business terms.

Note: Although bootstrapping is often used with small samples, we apply it here despite having N > 14,000 because its non-parametric nature helps produce confidence intervals that respect the true asymmetry of the salary distribution, without imposing assumptions of normality or symmetry.
Additionally, in the previous cell we also report the relative (%) difference in mean and median to contextualize practical relevance (not just significance).

Questions to answer (pairs of interest): We apply this technique to the comparisons we consider most informative:

  1. Experience: How much more does a Senior (SE) profile earn compared to a Mid-Level (MI) profile?
  2. Company size: Do Medium (M) companies actually pay more than Small (S) companies?
In [32]:
def bootstrap_difference(df, group_col, control_group, test_group, target='salary_usd'):
    # Data filtering
    data_control = df[df[group_col] == control_group][target].values
    data_test = df[df[group_col] == test_group][target].values

    # Statistic function (Difference in means: Test - Control)
    # Vectorized ('np.mean(axis=-1)' is faster and more technically robust than 'data.mean()')
    def diff_means(x, y, axis=-1):
        return np.mean(y, axis=axis) - np.mean(x, axis=axis)

    # Bootstrap execution
    # We set 'random_state' for reproducibility
    res = bootstrap(
        (data_control, data_test),
        diff_means,
        confidence_level=0.95,
        random_state=24
    )

    ci = res.confidence_interval
    mean_diff = np.mean(data_test) - np.mean(data_control)

    print(f"\n📊 BOOTSTRAP: {group_col} ({test_group} vs {control_group})")
    print(f"   Observed Mean Difference: ${mean_diff:,.2f}")
    print(f"   Confidence Interval (95%): [${ci.low:,.2f}, ${ci.high:,.2f}]")

    if ci.low > 0:
        print(f"   ✅ CONCLUSION: Significant positive difference ({test_group} earns more).")
    elif ci.high < 0:
        print(f"   ✅ CONCLUSION: Significant negative difference ({control_group} earns more).")
    else:
        print("   ❌ CONCLUSION: Not significant (the interval includes 0).")


# Specific experience analysis (Senior vs Mid)
bootstrap_difference(df_clean, 'experience_level', 'MI', 'SE')

# Specific company size analysis (Medium vs Small)
# We compare M vs S because it represents the first growth jump from a startup
bootstrap_difference(df_clean, 'company_size', 'S', 'M')
📊 BOOTSTRAP: experience_level (SE vs MI)
   Observed Mean Difference: $34,232.19
   Confidence Interval (95%): [$32,857.99, $35,623.00]
   ✅ CONCLUSION: Significant positive difference (SE earns more).

📊 BOOTSTRAP: company_size (M vs S)
   Observed Mean Difference: $8,530.28
   Confidence Interval (95%): [$6,578.70, $10,509.18]
   ✅ CONCLUSION: Significant positive difference (M earns more).
In [33]:
# Statistical significance vs practical relevance: absolute and relative difference (%)

def pair_practical(df, group_col, group_a, group_b, target_col="salary_usd"):
    a = df.loc[df[group_col] == group_a, target_col]
    b = df.loc[df[group_col] == group_b, target_col]

    mean_a, mean_b = a.mean(), b.mean()
    med_a, med_b = a.median(), b.median()

    diff_mean = mean_a - mean_b
    pct_mean = diff_mean / mean_b * 100

    diff_med = med_a - med_b
    pct_med = diff_med / med_b * 100

    print(f"\n📌 Practical comparison: {group_col} ({group_a} vs {group_b})")
    print(f"Mean:   {group_a} = {mean_a:,.0f} | {group_b} = {mean_b:,.0f} | Δ = {diff_mean:,.0f} ({pct_mean:.1f}%)")
    print(f"Median: {group_a} = {med_a:,.0f} | {group_b} = {med_b:,.0f} | Δ = {diff_med:,.0f} ({pct_med:.1f}%)")

pair_practical(df, "experience_level", "SE", "MI", "salary_usd")
pair_practical(df, "company_size", "M", "S", "salary_usd")
📌 Practical comparison: experience_level (SE vs MI)
Mean:   SE = 122,188 | MI = 87,955 | Δ = 34,232 (38.9%)
Median: SE = 116,907 | MI = 84,641 | Δ = 32,266 (38.1%)

📌 Practical comparison: company_size (M vs S)
Mean:   M = 113,600 | S = 102,147 | Δ = 11,453 (11.2%)
Median: M = 98,122 | S = 89,648 | Δ = 8,474 (9.5%)

A “practical” reading of the results (beyond the confidence interval):

  • Here we do not only confirm that the difference is different from 0; we also express it as an absolute Δ (USD) and a relative Δ (percentage).
  • In a business context, this translation into impact is key: a statistically significant difference might not be relevant if it is small in relative terms—and vice versa.

Experience (SE vs MI):
The 95% confidence interval for the difference in means is above 0, confirming that Senior profiles earn more than Mid-Level profiles.

Company size (M vs S):
The 95% confidence interval is also above 0, indicating that Medium companies pay more than Small ones.

Note: In the previous cell we also report the relative (%) difference in mean and median to contextualize practical relevance (not just statistical significance).

5. Correlation and Association Analysis¶


In this analysis, the target variable (salary_usd) is numerical, while the predictors (experience_level and company_size) are ordinal categorical variables.

Following the course methodology, when working with ordinal qualitative variables against a numerical target, we approach the analysis from two complementary perspectives to ensure statistical rigor and compliance with assumptions:

  1. Trend and Direction (Spearman Correlation):

    • We convert categories into numerical ranks (encoding).
    • We use Spearman’s correlation (instead of Pearson’s) because it is more robust to non-normal data and is specifically designed to evaluate monotonic relationships (increasing or decreasing) between ranked variables. This allows us to confirm the direction of the relationship.
  2. True Effect Strength (Association Metrics):

    • Eta-squared ($\eta^2$): Used to quantify the actual effect size, measuring the percentage of salary variance explained by each variable (without assuming strict linearity). This metric complements the Kruskal–Wallis test and helps assess the relevance of each variable as a predictor.
    • Cramér’s V: Used to validate independence between predictors and rule out multicollinearity. Specifically, to assess whether experience_level and company_size are associated with each other (i.e., potential categorical multicollinearity), we apply the Chi-square test and compute Cramér’s V, which normalizes the strength of association (0 = none, 1 = perfect) between nominal or ordinal variables.

Together, these metrics allow us to statistically validate both the usefulness and non-redundancy of the variables before incorporating them into the regression model.

In [34]:
# TREND ANALYSIS (Spearman Correlation)
print("--- SPEARMAN CORRELATION MATRIX (Direction) ---")

# Prepare the data -> numeric encoding while preserving the hierarchy (Junior < Senior)
df_corr = df_clean.copy()
df_corr['exp_code'] = df_corr['experience_level'].cat.codes
df_corr['size_code'] = df_corr['company_size'].cat.codes

# Compute correlation (method='spearman' due to non-normality and ordinal nature; it is also more robust to outliers)
corr_matrix = df_corr[['salary_usd', 'exp_code', 'size_code']].corr(method='spearman')

# Visualization
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', fmt=".2f", vmin=-1, vmax=1)
plt.title("Spearman Correlation")
plt.show()
--- SPEARMAN CORRELATION MATRIX (Direction) ---
No description has been provided for this image

Interpretation of Spearman’s correlation

  • salary_usd ↔ exp_code (ρ = 0.80)
    There is a strong, positive monotonic correlation: as experience level increases, the salary trend increases as well. This is consistent with the test results and with the effect size we will compute next using η².

  • salary_usd ↔ size_code (ρ = 0.17)
    The relationship is weak and positive. Larger companies show a slightly higher salary trend, although the strength of the association is low (in line with η² ≈ 0.03, shown below).

  • exp_code ↔ size_code (ρ = −0.03)
    The correlation is virtually zero. Company size and experience level behave independently, with no meaningful monotonic relationship. This confirms there is no risk of categorical multicollinearity between the predictors.

In [35]:
# EFFECT SIZE AND ASSOCIATION ANALYSIS (Advanced Metrics)

# First, we define the functions we will need

def compute_eta_squared(df, cat_col, num_col):
    """
    Computes effect size (Eta-squared) using the Kruskal–Wallis H statistic.
    Interpretation: % of variance in the numerical variable explained by the categorical variable.
    """
    groups = [df[df[cat_col] == val][num_col] for val in df[cat_col].unique()]
    H, _ = stats.kruskal(*groups)
    k = len(groups)     # k = number of groups
    n = len(df)         # n = total number of observations
    eta_sq = (H - k + 1) / (n - k)
    return max(0, eta_sq)

    """
    More detailed interpretation:
    - Numerator (H - k + 1)
        Adjusts the H statistic by removing the “noise” introduced by having multiple groups
    - Denominator (n - k)
        Normalizes the result so it becomes a proportion between 0 and 1
    - Result
        A value between 0 and 1, interpreted as:
            “Proportion of salary variability explained by the categorical variable”
    """


def compute_cramers_v(df, col1, col2):
    """
    Computes Cramér’s V based on the Chi-square statistic.
    Measures the association between two categorical variables (0 = none, 1 = perfect).
    """
    confusion_matrix = pd.crosstab(df[col1], df[col2])
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()

    # Normalize chi2
    phi2 = chi2 / n  # chi2 measures deviation from independence, and n = total sample size
    r, k = confusion_matrix.shape  # r = number of rows and k = number of columns

    # Bias correction (recommended for more robust results)
    with np.errstate(divide='ignore', invalid='ignore'):
        # Subtract bias to avoid inflating association
        phi2corr = max(0, phi2 - ((k - 1) * (r - 1)) / (n - 1))
        # Adjust “effective categories” to reduce error due to small cells
        rcorr = r - ((r - 1) ** 2) / (n - 1)
        kcorr = k - ((k - 1) ** 2) / (n - 1)
        # Corrected formula -> robust V between 0 and 1
        v = np.sqrt(phi2corr / min((kcorr - 1), (rcorr - 1)))

    return v
In [36]:
# RUN OUR ANALYSIS

print("--- EFFECT SIZE ON SALARY (Eta-squared) ---")
# Evaluate how much each predictor influences salary
for col in ['experience_level', 'company_size']:
    eta = compute_eta_squared(df_clean, col, 'salary_usd')
    print(f"Variable '{col}':")
    print(f"   -> η² = {eta:.4f} (Explains {eta*100:.1f}% of the salary variance)")

print("\n--- RELATIONSHIP BETWEEN PREDICTORS (Multicollinearity - Cramér's V) ---")
# Evaluate whether predictors provide redundant information
v_cramer = compute_cramers_v(df_clean, 'experience_level', 'company_size')
print("Association between 'experience_level' and 'company_size':")
print(f"   -> V = {v_cramer:.4f}")

# Automatic interpretation
if v_cramer < 0.3:
    print("   ✅ CONCLUSION: Low association. Predictors provide unique information (No multicollinearity risk).")
elif v_cramer < 0.7:
    print("   ⚠️ CONCLUSION: Moderate association. Watch for possible redundancy (Potential multicollinearity risk).")
else:
    print("   ❌ CONCLUSION: Very high association. Likely severe multicollinearity.")
--- EFFECT SIZE ON SALARY (Eta-squared) ---
Variable 'experience_level':
   -> η² = 0.6453 (Explains 64.5% of the salary variance)
Variable 'company_size':
   -> η² = 0.0308 (Explains 3.1% of the salary variance)

--- RELATIONSHIP BETWEEN PREDICTORS (Multicollinearity - Cramér's V) ---
Association between 'experience_level' and 'company_size':
   -> V = 0.0370
   ✅ CONCLUSION: Low association. Predictors provide unique information (No multicollinearity risk).

6. Predictive Modeling: Multiple Linear Regression¶


Once we have confirmed the statistical relevance of the predictors (experience_level and company_size) and ruled out multicollinearity, we proceed to build a Multiple Linear Regression (OLS) model.

The goal is to obtain a mathematical equation that allows us to estimate the expected salary ($Y$) as a function of experience level ($X_1$) and company size ($X_2$):

$$\text{Salary} = \beta_0 + \beta_1 \cdot (\text{Experience}) + \beta_2 \cdot (\text{Company Size}) + \epsilon$$

6.1 Variable Preparation: Dummy Encoding¶

To avoid imposing an artificial metric on our categorical variables (i.e., incorrectly assuming that the salary distance between Junior and Mid is the same as between Senior and Executive), we will use Dummy Variables (One-Hot Encoding).

This allows the model to estimate an independent coefficient (a “salary premium”) for each level, without forcing linearity constraints across categories.

The general equation we aim to estimate is:

$$Y = \beta_0 + \underbrace{\beta_1 D_{MI} + \beta_2 D_{SE} + \beta_3 D_{EX}}_{\text{Experience Levels}} + \underbrace{\beta_4 D_{M} + \beta_5 D_{L}}_{\text{Company Size}} + \epsilon$$

In [37]:
# VARIABLE PREPARATION (DUMMIES)

# NOTE:
# There are two common ways to do this in statsmodels:
# A) Using formulas (smf.ols): "salary_usd ~ C(experience_level) + C(company_size)"
#    More automatic because it creates dummies internally
# B) Using arrays (sm.OLS): requires creating dummies manually via pd.get_dummies
#    We'll use option (B) to keep full control over the columns and be explicit.

# Create dummy variables (one-hot encoding)
# drop_first=True removes one category as the reference (avoids perfect multicollinearity)
# dtype=int ensures 0/1 instead of True/False (avoids dtype issues in statsmodels)
X_dummies = pd.get_dummies(df_clean[['experience_level', 'company_size']], drop_first=True, dtype=int)

# Add constant (intercept) for OLS
X = sm.add_constant(X_dummies)

# Define target variable (y) and its log transform
y = df_clean['salary_usd']
y_log = np.log(df_clean['salary_usd'])  # log transform to correct heteroscedasticity

print("Variables ready. Model columns:", X.columns.tolist())
# Verify numeric types
print("\nData types in X:")
print(X.dtypes)
Variables ready. Model columns: ['const', 'experience_level_MI', 'experience_level_SE', 'experience_level_EX', 'company_size_M', 'company_size_L']

Data types in X:
const                  float64
experience_level_MI      int32
experience_level_SE      int32
experience_level_EX      int32
company_size_M           int32
company_size_L           int32
dtype: object

6.2 Model Selection: Linear vs. Logarithmic¶

To ensure the validity of our predictions, we must satisfy the Gauss–Markov assumptions, especially Homoscedasticity (the error variance should be constant).

First, we fit a baseline linear model on the raw salary target ($Y$) to diagnose residual behavior and evaluate whether transformations are needed.

In [38]:
# PLAIN LINEAR MODEL (diagnostic baseline)

# Fit the model without log transform
model_linear = sm.OLS(y, X).fit()

print("--- LINEAR MODEL DIAGNOSTICS (NO LOG) ---")
print(f"R-squared: {model_linear.rsquared:.4f}")
print(f"Log-Likelihood: {model_linear.llf:.2f}")
print(f"AIC: {model_linear.aic:.2f}")
print(f"BIC: {model_linear.bic:.2f}")

# Residual plot to detect heteroscedasticity
residuals_linear = model_linear.resid
plt.figure(figsize=(10, 6))
plt.scatter(model_linear.fittedvalues, residuals_linear, alpha=0.5, color='gray')
plt.title("Residual diagnostics (linear model)", fontsize=14)
plt.xlabel("Predicted salary", fontsize=12)
plt.ylabel("Residuals (error)", fontsize=12)
plt.axhline(0, color='red', linestyle='--', linewidth=2)
plt.show()
--- LINEAR MODEL DIAGNOSTICS (NO LOG) ---
R-squared: 0.6522
Log-Likelihood: -170192.26
AIC: 340396.52
BIC: 340442.01
No description has been provided for this image

Heteroscedasticity Correction: Logarithmic Transformation¶

After inspecting the residuals of the linear model, we detect a “fan-shaped” pattern (heteroscedasticity): the errors increase as salary increases. To correct this, we apply a logarithmic transformation to the dependent variable ($\log(Y)$).

This changes the model interpretation: the coefficients ($\beta$) will represent percentage changes in salary rather than fixed monetary amounts, which better matches real-world labor market dynamics.

$$\log(\text{Salary}) = \beta_0 + \beta X + \epsilon$$

In [39]:
# LOG MODEL

# Fit the model using log salary
model_log = sm.OLS(y_log, X).fit()

# Show the full statistical summary
print(model_log.summary())

# ASSUMPTION DIAGNOSTICS (Gauss–Markov validation) ---
residuals = model_log.resid
fitted_vals = model_log.fittedvalues

fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# -> Homoscedasticity (Residuals vs Fitted)
# We look for a shapeless "cloud". If there is no "fan" pattern, great!
sns.scatterplot(x=fitted_vals, y=residuals, ax=ax[0], alpha=0.6)
ax[0].axhline(0, color='red', linestyle='--', linewidth=2)
ax[0].set_title('Homoscedasticity', fontsize=12)
ax[0].set_xlabel('Predicted log(Salary)')
ax[0].set_ylabel('Residuals')

# -> Normality (Q-Q Plot)
# We look for blue points to follow the red line
sm.qqplot(residuals, line='45', fit=True, ax=ax[1])
ax[1].set_title('Normality (Q-Q Plot)', fontsize=12)

plt.tight_layout()
plt.show()

print("Interpretation:")
print("1. Left plot: if we see a dispersed cloud with no fan shape, we have corrected heteroscedasticity.")
print("2. Right plot: if the points align with the red line, residuals are approximately normally distributed.")
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             salary_usd   R-squared:                       0.679
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     6127.
Date:                Wed, 14 Jan 2026   Prob (F-statistic):               0.00
Time:                        21:35:05   Log-Likelihood:                -1170.5
No. Observations:               14517   AIC:                             2353.
Df Residuals:                   14511   BIC:                             2399.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  10.9016      0.005   2071.176      0.000      10.891      10.912
experience_level_MI     0.3288      0.006     54.258      0.000       0.317       0.341
experience_level_SE     0.6576      0.006    108.225      0.000       0.646       0.669
experience_level_EX     1.0173      0.006    161.743      0.000       1.005       1.030
company_size_M          0.0998      0.005     18.818      0.000       0.089       0.110
company_size_L          0.2345      0.005     43.853      0.000       0.224       0.245
==============================================================================
Omnibus:                     1233.878   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              396.771
Skew:                           0.067   Prob(JB):                     6.95e-87
Kurtosis:                       2.201   Cond. No.                         5.29
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
No description has been provided for this image
Interpretation:
1. Left plot: if we see a dispersed cloud with no fan shape, we have corrected heteroscedasticity.
2. Right plot: if the points align with the red line, residuals are approximately normally distributed.
In [40]:
# ADDITIONAL RESIDUAL PLOTS (Histogram and Boxplot)
fig, ax = plt.subplots(1, 2, figsize=(15, 6))

# Histogram
sns.histplot(residuals, bins=30, kde=True, ax=ax[0])
ax[0].set_title("Residuals histogram", fontsize=12)
ax[0].set_xlabel("Error (residuals)")

# Boxplot (to see outliers)
sns.boxplot(x=residuals, ax=ax[1])
ax[1].set_title("Residuals boxplot", fontsize=12)
ax[1].set_xlabel("Error distribution")

plt.show()

# Shape metrics
print(f"Skewness: {residuals.skew():.4f}")
print(f"Kurtosis: {residuals.kurtosis():.4f}")
No description has been provided for this image
Skewness: 0.0673
Kurtosis: -0.7985

6.3 Detailed Interpretation of Statistical Results¶

Below, we break down all the metrics returned by statsmodels and the residual plots to validate the model’s robustness and understand its impact on our study.

  • No. Observations: 14,517 → number of rows
  • Df Residuals: 14,511 (14,517 − 6 estimated parameters → 14,511 degrees of freedom)
  • Df Model: 5 (number of columns in X)
  • Covariance Type: nonrobust (heteroscedasticity-robust covariance was not assumed)

1. Goodness of Fit and Information Criteria¶

  • R-squared (0.679): The model explains 67.9% of the variance in salary. This means that nearly 70% of salary differences are explained purely by Experience and Company Size.

  • Adj. R-squared (0.678): Practically identical to R², confirming that we did not include irrelevant predictors.

  • F-statistic (6127) and Prob (F-statistic) (0.00): The p-value is effectively zero (< 0.05). We can confidently conclude the model is not due to chance; the predictors have a real relationship with salary. The model is statistically significant overall.

  • Log-Likelihood (−1170.5), AIC (Akaike Information Criterion: 2353) and BIC (Bayesian Information Criterion: 2399):

    • These values measure model fit while penalizing complexity.
    • Compared to the linear model (AIC ~ 340,000), the drop to 2,353 is dramatic. Although the change of scale (log vs. linear) affects comparability, such a low AIC together with a less negative log-likelihood strongly suggests the log model captures the data structure far better than the raw linear model.

2. Economic Interpretation of the Coefficients ($\beta$)¶

Since this is a log-level model ($\log(Y) = \beta X$), coefficients are interpreted as approximate percentage changes relative to the baseline salary:

  • Intercept (const = 10.90): Baseline log value for an Entry-level employee in a Small company.

    • Translation: $e^{10.90} \approx \$54,176$. This is the estimated baseline salary.
  • experience_level (the main driver):

    • Mid-Level ($\beta \approx 0.33$): Salary increases by ~39% relative to the baseline ($e^{0.33} - 1$).
    • Senior ($\beta \approx 0.66$): Salary increases by ~93% relative to the baseline ($e^{0.66} - 1$).
    • Executive ($\beta \approx 1.02$): Salary increases by ~177% ($e^{1.02} - 1$), a substantial jump as expected at these organizational levels.
  • company_size (a secondary but meaningful effect):

    • Large ($\beta \approx 0.23$): Working at a large company implies a ~26% salary premium ($e^{0.23} - 1$) relative to a small one, holding experience constant.

3. Assumption Validation (Residual Diagnostics)¶

To confirm model reliability, we combine statistical tests with visual inspection of residual plots:

A) Homoscedasticity and Independence

  • Durbin–Watson (1.986): Near the ideal value (~2), indicating no autocorrelation in the errors.
  • Cond. No. (Condition Number) (5.29): Assesses multicollinearity. Values > 30 may indicate multicollinearity (though not necessarily severe). A value of 5.29 (well below 30) is excellent, confirming what we saw with Cramér’s V: predictors are independent and the model is stable. There is no concerning multicollinearity.
  • Residuals vs. Fitted scatter plot: The point cloud appears stable and evenly spread along the horizontal axis. We have successfully removed the “fan-shaped” pattern present in the linear model, satisfying the constant-variance assumption.

B) Residual Normality

  • Numeric tests (Jarque–Bera (JB) = 396, Prob = 0.00 and Omnibus = 1233, Prob = 0.00): These reject perfect normality (Prob < 0.05), largely because the sample is very large ($N > 14,000$), making the tests highly sensitive to small deviations.

  • Q–Q plot: Excellent fit in the center (where most observations lie). Deviations in the tails indicate the model is highly accurate for typical salaries but less precise for extremely high or low salaries.

  • Histogram and boxplot:

    • The histogram shows a fairly symmetric bell shape (Skew ≈ 0.06), indicating errors are not strongly biased to one side.
    • The boxplot reveals some isolated outliers, explaining why Jarque–Bera rejects strict normality, even though the bulk of the distribution behaves well.
  • Skew (0.067): Indicates near-perfect symmetry (very close to 0). Errors are evenly distributed around zero, which is excellent.

  • Kurtosis (2.201): A normal distribution has kurtosis 3. A value of 2.2 indicates a platykurtic distribution (flatter than normal, with somewhat lighter tails overall), although some tail outliers still exist.

Validation Conclusion: The model is robust. Although some outliers generate deviations in the tails, the symmetry and homoscedasticity in the core of the data ensure that coefficient estimates and confidence intervals are valid for the vast majority of the population.

Modeling Conclusions¶

After the full fitting and validation process, we conclude:

  1. Valid model: Despite mild non-normality in the tails (highlighted by Jarque–Bera and the boxplot), the model satisfies the critical assumptions of homoscedasticity, linearity, and error symmetry. It is a reliable estimation tool.
  2. Factor hierarchy: Experience is statistically proven to be the dominant driver of salary, with exponential jumps at Senior/Executive levels. Company size acts as a secondary but significant multiplier.
  3. Predictive capability: With an $R^2$ close to 68%, the model provides highly accurate salary-range estimates for most of the market, although caution is needed for extreme cases (detected outliers).

6.4 Prediction and Confidence Intervals¶

Finally, we use the optimized logarithmic model to generate predictions. Since a single point prediction is not sufficient for decision-making, we will quantify the associated uncertainty in two complementary ways:

  1. Analytical interval: The classical computation of a 95% confidence interval for a specific case.
  2. Robust validation (bootstrapping): We will simulate 2,000 random train/test scenarios to estimate the model’s true Root Mean Squared Error (RMSE). This will tell us, with 95% confidence, how far off our model is on average when predicting the salary of a new employee.
In [41]:
# A) PREDICTION FOR A HYPOTHETICAL CASE

# Let's imagine a new case: a 'Senior' (SE) in a 'Large' (L) company
new_employee = pd.DataFrame(columns=X.columns)
new_employee.loc[0] = 0  # initialize all predictors to 0
new_employee['const'] = 1

# Activate the corresponding dummy variables
if 'experience_level_SE' in X.columns:
    new_employee['experience_level_SE'] = 1
if 'company_size_L' in X.columns:
    new_employee['company_size_L'] = 1

# Prediction and inverse log-transform (np.exp)
pred_log = model_log.predict(new_employee)[0]
pred_usd = np.exp(pred_log)

# Analytical confidence interval (95%)
pred_summary = model_log.get_prediction(new_employee).summary_frame(alpha=0.05)
ci_lower = np.exp(pred_summary['obs_ci_lower'][0])
ci_upper = np.exp(pred_summary['obs_ci_upper'][0])

print("--- CASE PREDICTION: Senior in a Large Company ---")
print(f"Estimated salary: ${pred_usd:,.2f}")
print(f"Interval (95%): [${ci_lower:,.2f} - ${ci_upper:,.2f}]")


# B) ROBUST VALIDATION (BOOTSTRAPPING)

rmse_list = []
n_iter = 2000

# Set a seed for reproducibility
np.random.seed(24)

print(f"\nRunning bootstrapping ({n_iter} iterations)...")

for _ in tqdm(range(n_iter)):
    # Split into train/test (randomness is controlled by the fixed seed)
    X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.2)  # note: no random_state here

    # Train and predict
    m_boot = sm.OLS(y_train, X_train).fit()  # model is fitted here
    preds_log = m_boot.predict(X_test)

    # Convert back to real USD to compute the true error
    rmse = np.sqrt(mean_squared_error(np.exp(y_test), np.exp(preds_log)))
    rmse_list.append(rmse)

# Compute mean and percentiles
rmse_mean = np.mean(rmse_list)
rmse_lower = np.percentile(rmse_list, 2.5)
rmse_upper = np.percentile(rmse_list, 97.5)

print(f"Global mean RMSE: ${rmse_mean:,.2f}")
print(f"Error range (95%): [${rmse_lower:,.0f} - ${rmse_upper:,.0f}]")
--- CASE PREDICTION: Senior in a Large Company ---
Estimated salary: $132,412.64
Interval (95%): [$79,169.73 - $221,462.28]

Running bootstrapping (2000 iterations)...
100%|██████████| 2000/2000 [00:18<00:00, 108.63it/s]
Global mean RMSE: $30,146.95
Error range (95%): [$29,360 - $30,976]

6.5 3D Visualization (Error Visualization)¶

To visually inspect prediction performance, we build a 3D plot comparing:

  • Experience level (x-axis)
  • Company size (y-axis)
  • Actual vs predicted salary (z-axis)

This helps identify where errors concentrate (e.g., at the extremes).

In [42]:
# Prepare a sample and create numeric codes on the fly (also set a seed)
# We use sample(500) to visualize density without overcrowding the plot
df_viz = df_clean.sample(500, random_state=24).copy()

# Since we defined the logical ordering (Entry -> Executive) at the beginning of the analysis,
# we can use .cat.codes to automatically respect that ordering
df_viz['exp_code_viz'] = df_viz['experience_level'].cat.codes
df_viz['size_code_viz'] = df_viz['company_size'].cat.codes

# Generate predictions for this sample
# Recreate the dummy variables for the model
X_viz = pd.get_dummies(df_viz[['experience_level', 'company_size']], drop_first=True, dtype=int)
X_viz = sm.add_constant(X_viz)

# Fill missing columns with 0 (in case the sample is missing a category)
for col in X.columns:
    if col not in X_viz.columns:
        X_viz[col] = 0
X_viz = X_viz[X.columns]  # reorder columns to match the original model

# Predict
pred_log_viz = model_log.predict(X_viz)
df_viz['pred_usd'] = np.exp(pred_log_viz)

# Build the INTERACTIVE 3D PLOT
fig = go.Figure()

# A) Actual points (Blue)
fig.add_trace(go.Scatter3d(
    x=df_viz['exp_code_viz'],
    y=df_viz['size_code_viz'],
    z=df_viz['salary_usd'],
    mode='markers',
    marker=dict(size=5, color='blue', opacity=0.7),
    name="Actual Data"
))

# B) Predicted points (Orange)
fig.add_trace(go.Scatter3d(
    x=df_viz['exp_code_viz'],
    y=df_viz['size_code_viz'],
    z=df_viz['pred_usd'],
    mode='markers',
    marker=dict(size=5, color='orange', opacity=0.9, symbol='diamond'),
    name="Model Prediction"
))

# C) Residual lines (Red)
for x_real, y_real, z_real, z_pred in zip(
    df_viz['exp_code_viz'],
    df_viz['size_code_viz'],
    df_viz['salary_usd'],
    df_viz['pred_usd']
):
    fig.add_trace(go.Scatter3d(
        x=[x_real, x_real],
        y=[y_real, y_real],
        z=[z_real, z_pred],
        mode='lines',
        line=dict(color='red', width=2),
        showlegend=False  # hide from the legend
    ))

# Final layout configuration with axis text labels
fig.update_layout(
    title="Prediction Errors: Actual vs Model (Sample of 500 cases)",
    scene=dict(
        xaxis=dict(title="Experience Level", tickvals=[0, 1, 2, 3], ticktext=["Entry", "Mid", "Senior", "Executive"]),
        yaxis=dict(title="Company Size", tickvals=[0, 1, 2], ticktext=["Small", "Medium", "Large"]),
        zaxis_title="Salary ($)"
    ),
    width=900,
    height=700
)

fig.show()

3D Plot Interpretation: The blue points represent the actual observed values, and the orange points represent the model’s predictions.

  • “Stair-step” effect: Since the predictors are categorical, the model outputs a fixed predicted value for each category combination (e.g., Senior + Large). Visually, this creates vertical “columns” where the blue points (real salaries) cluster around the orange points (predicted salaries).
  • Residuals (Red lines): The length of each red line represents the individual prediction error. We can see the model performs much better for low and mid salary ranges (short lines), but the error grows substantially for executive profiles (longer red lines), confirming the presence of outliers and higher variance among the highest salaries.

6.6 Model Uncertainty Analysis¶

Finally, we quantify predictive uncertainty via bootstrapping: we repeatedly resample the data, fit the model, and compute RMSE on held-out points.

This provides a distribution of the expected error and a robust 95% uncertainty range for model performance.

In [43]:
# Convert the bootstrapped error list to a DataFrame
errors_df = pd.DataFrame({"rmse": rmse_list})

# RMSE histogram
plt.figure(figsize=(10, 6))
errors_df['rmse'].hist(bins=30, edgecolor='black', alpha=0.7)
plt.title("Distribution of mean error (RMSE) across 2000 simulations")
plt.xlabel("Mean error in dollars ($)")
plt.ylabel("Frequency")
plt.axvline(errors_df['rmse'].mean(), color='red', linestyle='dashed', linewidth=2, label='Mean')
plt.legend()
plt.show()

# Distribution statistics
print(f"Skewness of error: {errors_df['rmse'].skew():.4f}")
print(f"Kurtosis of error: {errors_df['rmse'].kurtosis():.4f}")

# Confidence interval
mean_error = errors_df['rmse'].mean()
std_error = errors_df['rmse'].std()

lower = np.percentile(errors_df['rmse'], 2.5)
upper = np.percentile(errors_df['rmse'], 97.5)

print("\n--- FINAL ACCURACY CONCLUSION (Percentile method) ---")
print(f"The model's mean RMSE is ${mean_error:,.0f}.")
print("With 95% confidence, the prediction error will lie within:")
print(f"[{lower:,.0f}, {upper:,.0f}]")
print(f"This shows high model stability, with a standard deviation of only ${std_error:,.0f} across simulations.")
No description has been provided for this image
Skewness of error: 0.0438
Kurtosis of error: 0.0910

--- FINAL ACCURACY CONCLUSION (Percentile method) ---
The model's mean RMSE is $30,147.
With 95% confidence, the prediction error will lie within:
[29,360, 30,976]
This shows high model stability, with a standard deviation of only $412 across simulations.

7. Final Conclusions¶


After conducting an exhaustive statistical analysis on a sample of more than 14,000 records from the AI job market in 2025, we arrive at the following conclusions, grounded in data-driven evidence and predictive modeling:

1. Hierarchy of Salary Drivers¶

Both the association analysis and the regression model converge on a key finding:

  • Experience is the primary driver: Experience level (experience_level) alone explains 64.5% of salary variability ($\eta^2 \approx 0.65$). The salary jump from Mid-Level to Senior is substantial, and it becomes even more pronounced at Executive levels.
  • Company size is secondary: Although the effect is statistically significant, company size (company_size) explains only about 3% of the variability ($\eta^2 \approx 0.03$). While large companies (L) tend to pay more than small ones (S), this factor is far less decisive than a candidate’s experience.

2. Market Behavior¶

  • Positive skewness: Salary distributions are not normal. There is a strong concentration in low-to-mid ranges and a long right tail of extremely high salaries—typical of a fast-growing technology sector.
  • Uneven dispersion across segments: Beyond mean differences, salary variability differs across categories (Levene’s test). This points to wider salary bands in certain segments (e.g., negotiation power, specialization, location, bonuses), reinforcing the need to report intervals and rely on robust methods.
  • Independence of factors: Using Cramér’s V (0.03), we show there is no meaningful association between company size and experience level. Small firms also hire Senior profiles, and large firms also hire Juniors—without a strong structural bias.

3. Reliability of the Predictive Model¶

For salary prediction, we discarded the simple linear model due to heteroscedasticity and opted for a log-linear model (OLS on log-salary), yielding robust results:

  • Explanatory power: The model achieves an $R^2$ of 67.9%, meaning it explains nearly 7 out of every 10 dollars of salary differences using only these two variables.

  • Accuracy: Through bootstrap-based cross-validation (2,000 iterations), we estimate the model’s average error (RMSE) at approximately $30,147 USD.

    • Interpretation: With an average salary around $115k, a $30k error is acceptable for general market estimation, though it indicates that additional variables (e.g., specific skills or geographic location) likely explain the remaining ~32% of variance.

4. Executive Summary¶

With 95% statistical confidence, we conclude that the most effective strategy to maximize salary in the AI sector is to prioritize vertical growth (experience level) over company type. After the logarithmic transformation, the model satisfies the Gauss–Markov assumptions, making it a valid tool for estimating fair salary ranges based on seniority and corporate context—excluding extreme outliers.